produces a vector of size K such that sum(beta) = 0. The unconstrained representation requires only K - 1 values because the last is determined by the first K - 1.
A sum to zero vector is exactly what the name suggests. A vector where the sum of the elements equals 0. If you put a normal prior on the zero-sum vector the resulting variance will be less than the intended normal variance. To get the same variance as the intended normal prior do
where sigma is the intended standard deviation. FYI, it’s a bit more efficient to pre-calculate the sqrt(N * inv(N - 1)) in transformed_data. The general result to get a given variance from a normal with linear constraints is in: Fraser, D. A. S. (1951). Normal Samples With Linear Constraints and Given Variances. Canadian Journal of Mathematics, 3, 363–366. doi:10.4153/CJM-1951-041-9.
Prior to Stan 2.36, a sum-to-zero constraint could be implemented in one of two ways:
As a “hard” sum to zero constraint, where the parameter is declared to be an \(N-1\) length vector with a corresponding \(N\)-length transformed parameter whose first \(N-1\) elements are the same as the corresponding parameter vector, and the \(N^{th}\) element is the negative sum of the \(N-1\) elements.
As a “soft” sum to zero constraint with an \(N\)-length parameter vector whose sum is constrained to be within \(\epsilon\) of \(0\).
Up until now, users had to choose between the hard or soft sum-to-zero constraint, without clear guidance. As a general rule, for small vectors, the hard sum-to-zero constraint is more efficient; for larger vectors, the soft sum-to-zero constraint is faster, but much depends on the specifics of the model and the data.
For small \(N\) and models with sensible priors, the hard sum-to-zero is usually satisfactory. But as the size of the vector grows, it distorts the marginal variance of the \(N^{th}\). Given a parameter vector: \[
x_1, x_2, \dots, x_{N-1} \sim \text{i.i.d. } N(0, \sigma^2)
\] by the properties of independent normal variables, each of the free elements \(x_1, \ldots, x_{N-1}\) has variance \(\sigma^2\). However, the \(N^{th}\) element is defined deterministically as: \[
x_N = -\sum_{i=1}^{N-1} x_i
\] and its variance is inflated by a factor of \(N-1\). \[
\operatorname{Var}(x_N) = \operatorname{Var}\Bigl(-\sum_{i=1}^{N-1} x_i\Bigr)
= \sum_{i=1}^{N-1} \operatorname{Var}(x_i)
= (N-1)\sigma^2.
\] For large vectors, MCMC samplers struggle with the hard sum-to-zero constraint, as every change to any of the \(N-1\) elements also requires a corresponding change to the \(N^{th}\) element; balancing these changes introduces potential non-identifiabilities.
The soft sum-to-zero constraint is problematic for the following reasons.
The tolerance \(\epsilon\) (the scale of the penalty) must be chosen by the analyst. Too large, and the result is too far from zero to be effective, too small and the sampler cannot satisfy the constraint.
The soft constraint only penalizes deviations from zero, leading to weaker identifiability of the parameters. This can lead to slow convergence and mixing, as the sampler explores nearly non-identified regions.
The marginal variances may not reflect the intended prior.
The sum_to_zero_vector transform ensures that each element of the resulting constrained vector has the same variance. This improves the sampler performance, providing fast computation and good effective sample size. This becomes increasingly noticable as models increase in size and complexity. To demonstrate this, in this notebook we consider two different classes of models:
Multi-level regressions for binomial data with group-level categorical predictors
Spatial models for areal data
Multi-level Models with Group-level Categorical Predictors
In this section we consider a model which estimates per-demographic disease prevalence rates for a population.
The data consists of:
A set of per-demographic aggregated outcomes of a diagnostic test procedure with unequal number of tests per demographic.
A corresponding set of demographic descriptors encoded as a vector of categorical values. In this example these are named sex, age, eth, and edu, but there can be any number of demographic predictors with any number of categories.
The specified test sensitivity and specificity
In order to fit this model, we need to put a sum-to-zero constraint on the categorical variables.
In the following sections we set up a data-generating program and generate two test datasets, one large, one small. Then we run all three implementations of the sum-to-zero constraint on these datasets and compare the estimates of the group-level regression parameters. This provides a small demonstration of the behavoir of these three implementations under different conditions. A more thorough investigation would include more and different datasets, test procedures, and visualizations.
To investigate the predictive behavoir of this model at different timepoints in a pandemic, we have written a data-generating program to create datasets given the baseline disease prevalence, test specificity and sensitivity, the specified total number of diagnostic tests.
The full model is in file stan/binomial_4_preds.stan. It provides an estimate of the true prevalence based on binary tests with a given (or unknown) test sensitivity and specificity as follows.
transformed parameters {// true prevalencevector[N] p = inv_logit(beta_0 + beta_sex * sex_c + beta_age[age] + beta_eth[eth] + beta_edu[edu]);// incorporate test sensitivity and specificity.vector[N] p_sample = p * sens + (1 - p) * (1 - spec);}model { pos_tests ~ binomial(tests, p_sample); // likelihood ...
To constrain the group-level paramters age, eth, and edu, we use the sum_to_zero_vector.
In order to put a standard normal prior on beta_age, beta_eth, and beta_edu, we need to scale the variance, as suggested above. The scaling factors are pre-computed in the transformed data block, and applied as part of the prior.
In the generated quantities block we use Stan’s PRNG functions to populate the true weights for the categorical coefficient vectors, and the relative percentages of per-category observations. Then we use a set of nested loops to generate the data for each demographic, using the PRNG equivalent of the model likelihood.
The full data-generating program is in file stan/gen_binomial_4_preds.stan. Here we show the nested loop which generates the modeled and unmodeled data inputs.
Because the modeled data pos_tests is generated according to the Stan model’s likelihood, the model is a priori well-specified with respect to the data. Because the true parameters are defined in the generated quantities block, each sample provides a datasets from a different set of regression covariates and with different amounts of per-demographic data.
Creating Simulated Datasets
The data generating program allows us to create datasets for large and small populations and for finer or more coarse-grained sets of categories. The larger the number of strata overall, the more observations are needed to get good coverage.
We generate two datasets: one with a small number of observations, relative to the number of strata, and one with sufficient data to provide information on all combinations of demographics.
Plot the distribution of observed positive tests and the underlying prevalence.
Because the data-generating parameters and percentage of observations per category are generated at random, some datasets may have very low overall disease rates and/or many unobserved strata, and will therefore be pathologically hard to fit. This is informative for understanding what is consistent when generating a set of percentages and regression weights as is done in the Stan data generating program.
Assemble the data dictionary of all input data for the model which solves the inverse problem - i.e., estimates regression coefficients given the observed data. We use the generated data as the inputs. Because the output files are real-valued outputs, regardless of variable element type, model data variables of type int need to be cast to int. Here all the observed data is count and categorial data.
binomial_ozs_fit = binomial_ozs_mod.sample(data=data_small, parallel_chains=4)print("sum_to_zero_vector: runtimes in seconds, all chains")binomial_ozs_fit.time
10:54:55 - cmdstanpy - INFO - CmdStan start processing
10:54:56 - cmdstanpy - INFO - CmdStan done processing.
10:54:56 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 57, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Consider re-running with show_console=True if the above output is unclear!
sum_to_zero_vector: runtimes in seconds, all chains
binomial_ozs_fit_lg = binomial_ozs_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)print("sum_to_zero_vector, large: runtimes in seconds, all chains")binomial_ozs_fit_lg.time
10:54:56 - cmdstanpy - INFO - CmdStan start processing
10:54:56 - cmdstanpy - INFO - CmdStan done processing.
10:54:56 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 57, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Consider re-running with show_console=True if the above output is unclear!
sum_to_zero_vector, large: runtimes in seconds, all chains
binomial_hard_fit = binomial_hard_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)print("hard sum-to-zero: runtimes in seconds, all chains") binomial_hard_fit.time
10:54:57 - cmdstanpy - INFO - CmdStan start processing
10:54:57 - cmdstanpy - INFO - CmdStan done processing.
10:54:57 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 46, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 44, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 44, column 2 to column 38)
Consider re-running with show_console=True if the above output is unclear!
binomial_hard_fit_lg = binomial_hard_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)print("hard sum-to-zero, large: runtimes in seconds, all chains") binomial_hard_fit_lg.time
10:54:57 - cmdstanpy - INFO - CmdStan start processing
10:54:58 - cmdstanpy - INFO - CmdStan done processing.
10:54:58 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 46, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 44, column 2 to column 38)
Consider re-running with show_console=True if the above output is unclear!
hard sum-to-zero, large: runtimes in seconds, all chains
binomial_soft_fit = binomial_soft_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)print("soft sum-to-zero: runtimes in seconds, all chains") binomial_soft_fit.time
10:54:58 - cmdstanpy - INFO - CmdStan start processing
10:55:11 - cmdstanpy - INFO - CmdStan done processing.
10:55:11 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 46, column 2 to column 34)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 44, column 2 to column 34)
Consider re-running with show_console=True if the above output is unclear!
binomial_soft_fit_lg = binomial_soft_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)print("soft sum-to-zero, large: runtimes in seconds, all chains") binomial_soft_fit_lg.time
10:55:11 - cmdstanpy - INFO - CmdStan start processing
10:55:15 - cmdstanpy - INFO - CmdStan done processing.
10:55:15 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 46, column 2 to column 34)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 44, column 2 to column 34)
Consider re-running with show_console=True if the above output is unclear!
soft sum-to-zero, large: runtimes in seconds, all chains
In the small data regime, the soft-sum to zero takes considerably more wall-clock time to fit the data. On Apple M3 hardware, all three models quickly fit the large dataset.
Model Checking and Comparison
Check convergence
We check the R-hat and effective sample size (ESS) for all group-level parameters.
In order to do this multiway comparison, we assemble the individual summaries from the 6 runs above into two dataframes. We also compute the number of effective samples per second - a key metric of model efficienty. (Note: we’re using a development version of CmdStanPy to scrape time information from the CSV files because changes to CmdStan’s stansummary function removed the ESS/sec metric. This is a workaround for now).
Show Code
# small datasetozs_fit_summary = binomial_ozs_fit.summary(sig_figs=2)ozs_fit_summary.index = ozs_fit_summary.index.astype(str) +" a) ozs"ozs_fit_time = binomial_ozs_fit.timeozs_total_time =0for i inrange(len(ozs_fit_time)): ozs_total_time += ozs_fit_time[i]['total']ozs_fit_summary['ESS/sec'] = ozs_fit_summary['ESS_bulk']/ozs_total_timehard_fit_summary = binomial_hard_fit.summary(sig_figs=2)hard_fit_summary.index = hard_fit_summary.index.astype(str) +" b) hard"hard_fit_time = binomial_hard_fit.timehard_total_time =0for i inrange(len(hard_fit_time)): hard_total_time += hard_fit_time[i]['total']hard_fit_summary['ESS/sec'] = hard_fit_summary['ESS_bulk']/hard_total_timesoft_fit_summary = binomial_soft_fit.summary(sig_figs=2)soft_fit_summary.index = soft_fit_summary.index.astype(str) +" c) soft"soft_fit_time = binomial_soft_fit.timesoft_total_time =0for i inrange(len(soft_fit_time)): soft_total_time += soft_fit_time[i]['total']soft_fit_summary['ESS/sec'] = soft_fit_summary['ESS_bulk']/soft_total_timesmall_data_fits_summary = pd.concat([ozs_fit_summary, hard_fit_summary, soft_fit_summary])# large datasetozs_fit_lg_summary = binomial_ozs_fit_lg.summary(sig_figs=2)ozs_fit_lg_summary.index = ozs_fit_lg_summary.index.astype(str) +" a) ozs"ozs_fit_time_lg = binomial_ozs_fit_lg.timeozs_total_time_lg =0for i inrange(len(ozs_fit_time_lg)): ozs_total_time_lg += ozs_fit_time_lg[i]['total']ozs_fit_lg_summary['ESS/sec'] = ozs_fit_lg_summary['ESS_bulk']/ozs_total_time_lghard_fit_lg_summary = binomial_hard_fit_lg.summary(sig_figs=2)hard_fit_lg_summary.index = hard_fit_lg_summary.index.astype(str) +" b) hard"hard_fit_time_lg = binomial_hard_fit_lg.timehard_total_time_lg =0for i inrange(len(hard_fit_time_lg)): hard_total_time_lg += hard_fit_time_lg[i]['total']hard_fit_lg_summary['ESS/sec'] = hard_fit_lg_summary['ESS_bulk']/hard_total_time_lgsoft_fit_lg_summary = binomial_soft_fit_lg.summary(sig_figs=2)soft_fit_lg_summary.index = soft_fit_lg_summary.index.astype(str) +" c) soft"soft_fit_time_lg = binomial_soft_fit_lg.timesoft_total_time_lg =0for i inrange(len(soft_fit_time_lg)): soft_total_time_lg += soft_fit_time_lg[i]['total']soft_fit_lg_summary['ESS/sec'] = soft_fit_lg_summary['ESS_bulk']/soft_total_time_lglarge_data_fits_summary = pd.concat([ozs_fit_lg_summary, hard_fit_lg_summary, soft_fit_lg_summary])
All models have R-hat values of 1.00 for all group-level parameters and high effective sample sizes.
Comparison with the true parameters shows that the model recovers the sign of the parameter, but not the exact value. With more data and only a few categories, the model does a better job of recovering the true parameters.
In almost all cases, estimates for each parameter are the same across implementations to 2 significant figures. In a few cases they are off by 0.01; where they are off, the percentage of observations for that parameter is correspondingly low. This is as expected; all three implementations of the sum-to-zero constraint do the same thing; the sum_to_zero_vector implementation is both fast and efficient.
Calibration check
All models contain a generated quantities block, which creates y_rep, the posterior predictive sample. If the model is well-calibrated for the data, we expect that at least 50% of the time the observed value of y will fall in the central 50% interval of the y_rep sample estimates.
sum_to_zero_vector fit y total: 270, ct y is within y_rep central 50% interval: 225, pct: 83.33
Hard sum-to-zero fit y total: 270, ct y is within y_rep central 50% interval: 224, pct: 82.96
Soft sum-to-zero fit y total: 270, ct y is within y_rep central 50% interval: 223, pct: 82.59
Visualize the fit
Plot the distribution of the actual data against the predicted data values for a random sample of replicates. In file utils_dataviz.py we have implemented the equivalent of the R bayesplot package function ppc_dens_overlay.
For the multi-level model, the sum_to_zero_vector provides the fastest wall-clock time for both datasets.
Spatial Models with an ICAR component
Spatial auto-correlation is the tendency for adjacent areas to share similar characteristics. Conditional Auto-Regressive (CAR) and Intrinsic Conditional Auto-Regressive (ICAR) models, first introduced by Besag, 1974, account for this by pooling information from neighboring regions. The BYM model, (Besag, York, Mollié, 1991) extends a lognormal Poisson model plus ICAR component for spatial auto-correlation by adding an ordinary random-effects component for non-spatial heterogeneity. The BYM2 model builds on this model and subsequent refinements.
The data consists of motor vehicle collisions in New York City, as recorded by the NYC Department of Transportation, between the years 2005-2014, restricted to collisions involving school age children 5-18 years of age as pedestrians. Each crash was localized to the US Census tract in which it occurred, using boundaries from the 2010 United States Census, using the 2010 Census block map for New York City. File data/nyc_study.geojson contains the study data and census tract ids and geometry.
The shapefiles from the Census Bureau connect Manhattan to Brooklyn and Queens, but for this analysis, Manhattan is quite separate from Brooklyn and Queens. Getting the data assembled in the order required for our analysis requires data munging, encapsulated in the Python functions in file utils_nyc_map.py. The function nyc_sort_by_comp_size removes any neighbor pairs between tracts in Manhattan and any tracts in Brooklyn or Queens and updates the neighbor graph accordingly. It returns a clean neighbor graph and the corresponding geodataframe, plus a list of the component sizes. The list is sorted so that the largest component (Brooklyn and Queens) is first, and singleton nodes are last.
Show Code
from utils_nyc_map import nyc_sort_by_comp_size(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)nyc_comp_sizes
[1360, 329, 271, 108, 22, 2, 1, 1, 1]
To check our work we examine both the geodataframe and the map.
from splot.libpysal import plot_spatial_weights plot_spatial_weights(nyc_nbs, nyc_gdf)
Model 1: The BYM2 model, Riebler et al. 2016
The key element of the BYM2 model is the ICAR component. Its conditional specification is a multivariate normal random vector \(\mathbf{\phi}\) where each \({\phi}_i\) is conditional on the values of its neighbors.
The joint specification rewrites to a Pairwise Difference,
Each \({({\phi}_i - {\phi}_j)}^2\) contributes a penalty term based on the distance between the values of neighboring regions. However, \(\phi\) is non-identifiable, constant added to \(\phi\) washes out of \({\phi}_i - {\phi}_j\). Therefore, a sum-to-zero constraint is needed to both identify and center \(\phi\).
The Stan implementation of the ICAR component computes the sum of the pairwise distances by representing the spatial adjacency matrix as a array of pairs of neighbor indices.
data { ...// spatial structureint<lower = 0> N_edges; // number of neighbor pairsarray[2, N_edges] int<lower = 1, upper = N> neighbors; // columnwise adjacent
The ICAR prior comes into the model as parameter phi.
In this section, we compare three ways of implementing the sum-to-zero constraint on phi.
In model bym2_ozs.stan, phi is declared as a sum_to_zero_vector.
In model bym2_ozs_hard.stan, phi_raw is the unconstrained parameter of size N - 1, and the N-length parameter phi is computed in the transformed parameters block.
In model bym2_soft.stan, phi is declared as an ordinary vector, and the sum-to-zero constraint is combined with the prior:
The ICAR model requires that the neighbor graph is fully connected for two reasons:
The joint distribution is computed from the pairwise differences between a node and its neighbors; singleton nodes have no neighbors and are therefore undefined.
Even if the graph doesn’t have any singleton nodes, when the graph has multiple connected components a sum-to-zero constraint on the entire vector fails to properly identify the model.
Because the BYM2 model includes an ICAR component, it too requires a fully connected neighbor graph. We can either artifically connect the map, or we can analyze the NYC dataset on a per-component basis, starting with the largest component which encompasses Brooklyn and Queens (excepting the Rockaways).
Show Code
from libpysal.weights import Queenbrklyn_qns_gdf = nyc_gdf[nyc_gdf['comp_id']==0].reset_index(drop=True)brklyn_qns_nbs = Queen.from_dataframe(brklyn_qns_gdf , geom_col='geometry')plot_spatial_weights(brklyn_qns_nbs, brklyn_qns_gdf ) print(f'number of components: {brklyn_qns_nbs.n_components}')print(f'islands? {brklyn_qns_nbs.islands}')print(f'max number of neighbors per node: {brklyn_qns_nbs.max_neighbors}')print(f'mean number of neighbors per node: {brklyn_qns_nbs.mean_neighbors}')
number of components: 1
islands? []
max number of neighbors per node: 14
mean number of neighbors per node: 5.977941176470588
Data assembly
The inputs to the BYM2 model are
The Poisson regression data
int<lower=0> N - number of regions
array[N] int<lower=0> y - per-region count outcome
vector<lower=0>[N] E - the population of each region (a.k.a. “exposure”),
int<lower=1> K - the number of predictors
matrix[N, K] xs - the design matrix
The spatial structure
int<lower = 0> N_edges - the number of neighbor pairs
real tau - the scaling factor, introduced in the BYM2
The scaling factor tau was introduced by Reibler et al so that the variance of the spatial and ordinary random effects are both approximately equal to 1, thus allowing for a straightforward estimate of the amount of spatial and non-spatial variance. We have written a helper function called get_scaling_factor, in file utils_bym2.py which takes as its argument the neighbor graph and computes the geometric mean of the corresponding adjacency matrix.
Compare the fitted values of spatial component phi, which has a sum-to-zero constraint.
Show Code
brklyn_qns_phi_ozs = brklyn_qns_ozs_fit.stan_variable('phi')brklyn_qns_phi_soft = brklyn_qns_hard_fit.stan_variable('phi')brklyn_qns_phi_hard = brklyn_qns_soft_fit.stan_variable('phi')print('ozs: marginal mean estimates last elements phi', np.mean(brklyn_qns_phi_ozs, axis=0)[-7:])print('hard: marginal mean estimates last elements phi', np.mean(brklyn_qns_phi_hard, axis=0)[-7:])print('soft: marginal mean estimates last elements phi', np.mean(brklyn_qns_phi_soft, axis=0)[-7:])brklyn_qns_vars_phi_ozs = np.var(brklyn_qns_phi_ozs, axis=0)print('ozs: marginal variances last elements phi', brklyn_qns_vars_phi_ozs[-7:])brklyn_qns_vars_phi_hard = np.var(brklyn_qns_phi_hard, axis=0)print('hard: marginal variances last elements phi', brklyn_qns_vars_phi_hard[-7:])brklyn_qns_vars_phi_soft = np.var(brklyn_qns_phi_soft, axis=0)print('soft: marginal variances last elements phi', brklyn_qns_vars_phi_soft[-7:])
ozs: marginal mean estimates last elements phi [-0.95 -0.43 -0.82 -0.69 -0.66 -0.11 -1.36]
hard: marginal mean estimates last elements phi [-0.98 -0.43 -0.84 -0.67 -0.68 -0.11 -1.34]
soft: marginal mean estimates last elements phi [-0.93 -0.43 -0.81 -0.69 -0.67 -0.12 -1.4 ]
ozs: marginal variances last elements phi [0.44 0.3 0.33 0.33 0.28 0.25 0.49]
hard: marginal variances last elements phi [0.42 0.32 0.33 0.33 0.28 0.25 0.49]
soft: marginal variances last elements phi [0.42 0.34 0.33 0.34 0.26 0.24 0.52]
All implementations provide the approximately the same marginal variances.
We can repeat this procedure with the next largest component, the Bronx (excepting City Island), which has 329 regions, roughly 1/3 of the size of Brooklyn-Queens, with 1360 regions.
Show Code
bronx_gdf = nyc_gdf[nyc_gdf['comp_id']==1].reset_index(drop=True)print(f'number of regions: {bronx_gdf.shape[0]}')bronx_nbs = Queen.from_dataframe(bronx_gdf , geom_col='geometry')plot_spatial_weights(bronx_nbs, bronx_gdf ) print(f'number of components: {bronx_nbs.n_components}')print(f'islands? {bronx_nbs.islands}')print(f'max number of neighbors per node: {bronx_nbs.max_neighbors}')print(f'mean number of neighbors per node: {bronx_nbs.mean_neighbors}')
number of regions: 329
number of components: 1
islands? []
max number of neighbors per node: 12
mean number of neighbors per node: 5.860182370820668
Compare the fitted values of spatial component phi, which has a sum-to-zero constraint.
Show Code
bronx_phi_ozs = bronx_ozs_fit.stan_variable('phi')bronx_phi_soft = bronx_hard_fit.stan_variable('phi')bronx_phi_hard = bronx_soft_fit.stan_variable('phi')bronx_vars_phi_ozs = np.var(bronx_phi_ozs, axis=0)print('ozs: marginal variances last elements phi', bronx_vars_phi_ozs[-5:])bronx_vars_phi_hard = np.var(bronx_phi_hard, axis=0)print('hard: marginal variances last elements phi', bronx_vars_phi_hard[-5:])bronx_vars_phi_soft = np.var(bronx_phi_soft, axis=0)print('soft: marginal variances last elements phi', bronx_vars_phi_soft[-5:])boxplot = marginal_variances_boxplot( ["1.ozs", "2.hard", "3.soft"], [bronx_vars_phi_ozs, bronx_vars_phi_hard, bronx_vars_phi_soft])boxplot
ozs: marginal variances last elements phi [0.27 0.31 0.27 0.22 0.23]
hard: marginal variances last elements phi [0.26 0.31 0.26 0.21 0.24]
soft: marginal variances last elements phi [0.26 0.31 0.25 0.22 0.25]
Discussion
All implementations return almost identical estimates. The sum_to_zero_vector consistently has the fastest running time. The marginal variances of the spatial component phi are roughly the same across all models; presumably due to the fact that the ICAR prior is properly constraining the variances.
Model 2: The BYM2_multicomp model, Freni-Sterrantino et al, 2018
In the previous section, we analyzed the New York City component-wise. This is highly unsatisfactory. In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes.
Singleton nodes (islands) are given a standard Normal prior
Compute per-connected component scaling factor
Impose a sum-to-zero constraint on each connected component
We have followed these recommendations and implemented this model in Stan. The full implementation details can be found in the notebook * The BYM2_multicomp model in Stan
For this case study, we provide 2 implementations of the BYM2_multicomp model: one which uses the sum_to_zero_vector and one which implements the soft sum-to-zero constraint.
It is necessary to constrain the the elements of the spatial effects vector phi on a component-by-component basis. Stan’s slicing with range indexes, provides a way to efficiently access each component. The helper function nyc_sort_by_comp_size both sorts the study data by component and adds the component index to the geodataframe.
In the BYM2 model for a fully connected graph the sum-to-zero constraint on phi is implemented directly by declaring phi to be a sum_to_zero_vector, which is a constrained parameter type. The declaration:
sum_to_zero_vector[N] phi; // spatial effects
creates a constrained variable of length \(N\), with a corresponding unconstrained variable of length \(N-1\).
In order to constrain slices of the parameter vector phi, we do the following:
In the parameters block, we declare the unconstrained parameter phi_raw as an regular vector vector (instead of a sum_to_zero_vector).
For a fully connected graph of size \(N\), the size of the unconstrained sum-to-zero vector is \(N-1\). For a disconnected graph with \(M\) non-singleton nodes, the size of phi_raw is \(M\) minus the number of connected components.
The constraining transform is broken into two functions:
function zero_sum_constrain_lp, the actual constraining transform, which corresponds directly to the built-in zero_sum_vector transform.
function zero_sum_constrain_components_lp, which handles the slicing, and calls zero_sum_constrain_lp on each component.
/** * Constrain sum-to-zero vector * * @param y unconstrained zero-sum parameters * @return vector z, the vector whose slices sum to zero */vector zero_sum_constrain_lp(vector y) {int N = num_elements(y);vector[N + 1] z = zeros_vector(N + 1);real sum_w = 0;for (ii in1:N) {int i = N - ii + 1; real n = i;real w = y[i] * inv_sqrt(n * (n + 1)); sum_w += w; z[i] += sum_w; z[i + 1] -= w * n; }return z; }
zero_sum_components_lp: slices vector phi by component, applies constraining transform to each.
/** * Component-wise constrain sum-to-zero vectors * * @param phi unconstrained vector of zero-sum slices * @param idxs component start and end indices * @param sizes component sizes * @return vector phi_ozs, the vector whose slices sum to zero */vector zero_sum_components_lp(vector phi,array[ , ] int idxs,array[] int sizes) {vector[sum(sizes)] phi_ozs;int idx_phi = 1;int idx_ozs = 1;for (i in1:size(sizes)) { phi_ozs[idx_ozs : idx_ozs + sizes[i] - 1] = zero_sum_constrain_lp(segment(phi, idx_phi, sizes[i] - 1)); idx_phi += sizes[i] - 1; idx_ozs += sizes[i]; }return phi_ozs; }
Data Assembly
The helper function nyc_soft_by_comp_size adds component info to the geodataframe. It also returns the neighbor graph over the full dataset, plus a list of component sizes.
Show Code
from utils_nyc_map import nyc_sort_by_comp_sizefrom utils_bym2 import nbs_to_adjlist, get_scaling_factors(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)# design matrixdesign_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])design_mat = nyc_gdf[design_vars].to_numpy()design_mat[:, 1] = np.log(design_mat[:, 1])design_mat[:, 2] = np.log(design_mat[:, 2])# spatial structurenyc_nbs_adj = nbs_to_adjlist(nyc_nbs)component_sizes = [x for x in nyc_comp_sizes if x >1]scaling_factors = get_scaling_factors(len(component_sizes), nyc_gdf)
We assemble all inputs into dictionary bym2_multicomp_data.
bym2_multicomp_ozs_fit = bym2_multicomp_ozs_mod.sample(data=bym2_multicomp_data, iter_warmup=3000)print("sum_to_zero_vector: runtimes in seconds, all chains") bym2_multicomp_ozs_fit.time
11:00:34 - cmdstanpy - INFO - CmdStan start processing
11:01:50 - cmdstanpy - INFO - CmdStan done processing.
sum_to_zero_vector: runtimes in seconds, all chains
We can compare the sum_to_zero_vector implementation to an implementation of the soft sum-to-zero constraint. In the bym2_soft.stan model, the soft sum-to-zero constraint is combined directly with the ICAR prior:
For the BYM2_multicomp model, split this into two parts. The ICAR prior is applied to phi, and then we iterate through the components, applying the sum-to-zero constraint to each in turn.
The BYM2 model has more data and a relatively complex multilevel structure. Before Stan 2.36, for this model and dataset, the soft sum-to-zero constraint was much faster than the hard sum-to-zero constraint. Here we show that the sum_to_zero_vector greatly improves the run time.
For the BYM2_multicomp model, the soft sum-to-zero implementation is painfully slow. Both implementation get exactly the same estimates, which simply confirms that both models are correctly implemented.
Although the ESS_bulk numbers for the both implmentations are roughly comparable, the ESS_sec for the sum_to_zero_vector implementation is over 12 times faster.
Conclusion: the sum_to_zero_vector just works!
The more complex the model, the greater the need for the sum_to_zero_vector.